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Abstract. We are thankful to the discussants for their hard, interesting 
work. The main purpose of our paper was to give reasonably sharp rates 
of convergence for some simple examples of the Gibbs sampler. We 
chose examples from expository accounts where direct use of available 
techniques gave practically useless answers. Careful treatment of these 
simple examples grew into bivariate modeling and Lancaster families. 
Since bounding rates of convergence is our primary focus, let us begin 
there. 



1. RATES OF TWO COMPONENT 
GIBBS SAMPLERS 

Consider the beta/binomial example (with a uni- 
form prior) discussed in our introduction. Some of 
our students tried to use the Harris recurrence tech- 
niques directly on the two component chain. The 
two-component chain K goes {x, 6) (x, 6') (x', 9'] 
To establish the drift condition: E{V{Xi^i) \ Xi = 
x) < XV{x) + 6, they chose V{x, 9) = x. Then E{x' \ 
{x,9)) = ^ + ^, so A > ^,6 = ^ work. For 
the minorization condition, they used the factoriza- 
tion 



infx/il^* \ x), e = jQg{9)d9 = ^ and q{x,9) ■- 
9{(^)f 2ix I 9). Then the minorization condition 

/((x', 9') I (x, 9)) > eq{x, 9) Vx, x', 9, 9' 

is satisfied. This leads to the bound 



/II 



TV 



with 



1 + d 



a 



f{{x\9')\{x,9)) = h{9' \ x)f2{x' \ 9') 

with I x) the Beta(x -|- 1, n — x -|- 1) density and 
/2(- I 9') the Binomial(n, density. Let g{9) = 



l + 2b + \d' 



l + 2{\d + b), d> 



2b 
1-X' 
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For n = 100 and xq G (0, \), they chose d = 1000, 
r = Y(|)Q . The bound says that if we run the sampler 
10^^ steps, the total variation distance will be less 
than 0.01. 

A similarly poor rate follows from Proposition and 
Example 4.1.1 of Berti et al. The point of spelling 
out this example is not to make fun of anyone, but 
to emphasize how a reasonable first pass at using 
off the shelf tools can lead to a useless answer. Here, 
despite the fact that an explicit eigenfunction corre- 
sponding to the largest eigenvalue was available as 
a choice for the drift function V . 

We are impressed and thankful to Berti et al. 
and Jones and Johnson for carrying out the work 
to massage their bounds into a more useable form. 
We regard the treatment of the normal example in 
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Jones and Johnson as particularly successful (we 
don't see any practical difference between "3 steps" 
and "99 steps"). The reader who studies their ar- 
gument will find clever, nonobvious choices coupled 
with computer work. Proposition 3 in Berti et al. 
also seems quite useful. They say they are play- 
ing "devils advocate." We note that "the devil is 
in the details." We have tried their suggestion to 
use "numerical evaluations" to make a choice of d, r 
and B from their Proposition 3 for their example 
4.2.1. After some playing around, the best we found 
is d = A,r = 0.11, B = (2, 3). Putting this into their 
bound gives 

||/(x,-)-P||tv 
< (0.99986)^ + (0.998497)^(2 + x). 

For X = 0,6 = 1, this gives that 34,000 steps are 
required to make total variation distance smaller 
than 0.01. Our bounds show that seven steps are re- 
quired. We are trying to use their ideas for the three 
component example with joint density f{j,6,n) = 
{j)e^{l - ^)"~^'^^. It does not seem easy. 

2. TWO MORE FOCUSED RESPONSES 

Jones and Johnson suggest that "the existence of 
a CLT is asymptotic." We disagree. In situations 
such as the present one, with control on the spectral 
gap, there are non-asymptotic Berry-Esseen results 
for additive functions of Markov chains. Useful ref- 
erences are Mann [15] or Lezaud [14]. In situations 
where one has explicit constants for geometric er- 
godicity, the work of Kontoyanis and Meyn [11, 12] 
seems quite explicit. 

At the end of their comment Berti et al. suggest 
that the conditionally reducible families of Consonni 
and Veronese [5] may be amenable to our explicit 
analytical techniques. This has recently been pushed 
through for multinomial, multivariate normal and 
other examples in Khare and Zhou [10]. 

3. LANCASTER FAMILIES 

Gerard Letac has given a masterful summary of 
this part of our paper along with several new exam- 
ples. As anyone who studies this subject learns, fas- 
cinating new examples "pop out of nowhere." While 
our paper was being edited, a large new class of pro- 
cesses with polynomial eigenfunctions surfaced in 
the work of Bryc, Matysiak and Wesolowski [4]. Or- 
thogonal polynomials have a hierarchy; lower ones 



(e.g., Hermite) being limits of higher ones (e.g., La- 
guerre, Charlier). At the top of the list are the Askey- 
Wilson polynomials. These have yet to appear in 
natural probability problems. Just below them are 
the very similar Al-Salem Chihara family. These are 
central to the work of Bryc et al. 

Letac rightly points out that our location fami- 
lies are a subclass of models suggested by Eagleson. 
We would like to point out a strange anthropological 
feature of this part of the world. In this age of "com- 
putational statistic," the kind of distribution theory 
that Letac (and we) enjoy so much is sometimes re- 
garded as an old fashioned corner of statistics. We 
recently fielded a question from Susan Holmes who 
had trivariate count data (668 patients with counts 
of number of mutations in three regions of each pa- 
tient's HIV strain). The data had Poisson margins 
and curious correlations. Because we knew of Ea- 
gleson's work and its extensions by Letac [13] and 
Griffiths et al. [8], along with practical implementa- 
tion by Karlis and Meligkotsidou [9] , we were able to 
suggest simple models which made good sense (and 
matched the data). The old fashioned corner shone 
brightly, at least for a moment. See Rhee et al. [16]. 

We would like to add one thought to Letac's list 
of three. We regard one of our major contibutions 
as the use of Lancaster families for explicit determi- 
nation of rates of convergence of the Gibbs sampler. 
This allows us to harness years of work by Letac 
and his students along with the wonderful tools de- 
veloped by the orthogonal polynomial community to 
answer simple interesting questions in mathematical 
statistics. 

4. SCANNING STRATEGIES 

George Casella and Richard Levine bring a fresh 
perspective, useful literature and great new ques- 
tions. We continue the discussion in two directions. 

4.1 Diagonalization for Non-Uniform 
Scan Strategies 

It is not necessary to use uniform coordinate choices 
to diagonalize our random scan samplers. Suppose 
that we choose the x-coordinate with probability a 
and the 9 coordinate with probability 1 — a. Using 
the setup from Section 3, the corresponding random 
scan operator Ka on L^{P) is given by 

Kag{x,0)=a[ g{x,6)7r{9' \x)TT{d9') 
Je 

+ {l-a) f g{x',9)fe{x')ix{dx') 
Jx 
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ygeL\P). 

For < A; < c, consider Ka acting on pk{x) + uqk{0) 
where u satisfies 



(1) au(l + ;Ufcti) = (1 - a)(% + 

The result is 
a{pk{x) + Ex[qk{0')]u) 

+ {l-a){Ee[pk{x)]+uqk{e)) 
= a(l + f^ku)Pk{x) + (1 - a){'i]k + u)qk{6) 
(1 - a){r]k + u 



a{l + fXku) 



Pk{x) + 



a{l + fiku) 
a(l + fj,ku)[pk{x) + uqk{0)]. 



qk{e) 



The last equality follows from (1). If c < oo, then for 
k > c, Lemma A2 (Appendix) shows that 
Ex[Qk{(^')] = for all x. It follows that is diago- 
nalizable with eigenvalues/eigenfunctions 

1 ± v/(l - 2a)^ + 4a(l - a)fikVk 



Pk{x) + 



(1 - 2a) ± V(l-2a)^ + 4a(l-a)^fc% 



■qkiO) forO</c<c, 
1 — a, gA:(0) for c < /c < oo. 
In particular, the spectral gap is 



1 - ^(1 - 2a)^ + 4a(l - a)^ir/i 
2 ' 

Clearly, the spectral gap is maximized when a = ^ • 
Hence, if we choose spectral gap as a measure of 
convergence optimality, then uniformly choosing co- 
ordinates is the best way. However, as we point out 
later on, spectral gap is not always the most accu- 
rate criterion for measuring convergence optimality, 
and convergence of Markov chains often depends on 
more subtle notions. 

4.2 Comparison with Systematic Scan Strategies 

The class of systematic scan strategies is frequently 
used in practice and just "seems sensible." It is hard 
to prove things, or compare to random scan strate- 
gies, especially for high-dimensional problems, be- 
cause the systematic scan chains become quite non- 
local. However, for our examples it is not difficult 
to analyze the random scan chain. We give the de- 
tails for the beta-binomial case (uniform prior), but 
mostly everything applies to all of the examples. 



Let K [defined in (2.3)] be the operator corre- 
sponding to the random scan chain. ThenK = ^{Pi + 
P2), where 

Pig{x,9)= [ g{x,e')7T{9'\x)7r{de') "^g^L^P) 
Je 

and 

P2g{x,e) 

g{x',e)f{x'\e)m{dx') ygeL\P), 

X 

are the projection operators onto L^{m) and L^(7r), 
respectively. 

Proposition 1. For the beta/binomial random 
scan chain (uniform prior), 



/I|tv>^(1 



n + 2 



\/n>l,9> -,£>1 
- ' - 2' - 



and 



/II 



TV 



l/2\ l-l 



n + 2/1 1/ 2 
+ ">i — {2+-2V-—2 

1 3n 

yn>i,e>-,£>—. 

Remark. Note that ^ + ^(1-;^)^/^ = l-;iT2 + 
O(^). For the systematic scan Gibbs samplers, the 
distance after i steps is roughly (up to small ex- 
plicit multiplicative constants) (1 — :p^Y- Hence in 
this sense, the random scan chain takes twice the 
amount of time as the systematic scan chain to con- 
verge to the stationary distribution. Although one 
might argue that computationally one step of the 
systematic scan chains is comparable to two steps 
of the random scan chain, hence they are equivalent 
computationally. 

Proof of Proposition 1. The function (/)(x, 
9) = [x — ^) + y^n{n + 2){6 — ^) is an eigenfunction 

corresponding to the eigenvalue ^ -|- ^^J^^ for K. 

Using the bound of Lemma 2.1 we have 



IK" 



,, 1/1 1 r~n~ 
^"^^-3l2 + 2V^ 



1 



n + 2 
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yn>l,0>-,£>l. 

This shows that i must be of order n to have a 
chance of total variation convergence. We next show 
that this order of steps suffice. 

For the upper bound, we expand K'^ using the 
binomial theorem and use the fact that Pi,P2 are 
projections, so repeated terms cancel out. Thus, 

i?3 = (l(Pi+P2)f 

= l{{Pl+2PiP2 + PlP2Pl) 

+ {P2 + 2P2Pl+P2PlP2)}. 

Let K and K be the systematic^can operators de- 
fined in Section 2.1. Note that K = P1P2 and K = 
P2P1. For £ > it follows from the work done in 



16 ' 



Section 4 that Vn > 1, ^ > i. 



|i^;;,e-/llTV< 10(^1 

|<«-/||tv<io('i 



n + 2 
2 



£-1/2 



n + 2, 

The number of terms in^the binomial expansion of 
which collapse to or is easily seen to 
be (2^1^)5 1 ^ J < 5- The number of terms on the 

binomial expansion which collapse to Pi or P2 
is easily seen to be (^2})^^ ^ J < |- Note that: 

1. II • IItv is convex and || • ||tv ^ 1- 

2. By Azuma's inequality, < e"'/^ V/ > 1. 

3. IK^^PiW- /IItv <||^le- /IItv. 

4. ||(i^^P2ke- /IItv <||i^;i,e- /IItv. 

Using the facts above and the binomial expansion of 



(i(Pi + P2))', it follows that for I > 



3n 
4 ' 



\K' 



/II 



TV 



<3e(^-i)/8 



n 



j=e/4+i 



2'- 



<3e 



-{i-l)/8 



^ ^^^/n+T^(,tl)(Vn/(n+2)r^ 



n 



/n + 2 /I 1, 
+ ^0^/^ 2 + 2'^ 



2 y/2y-i 



Remark. The calculations can be carried out 
for the non-symmetric mixture aPi + (1 — a)P2 - The 
multipliers of the condensed terms K^,K^, Pi 
and K^P2 are now polynomials in a which can be 
given explicitly. The asymptotics of these multipli- 
ers are available using the distribution theory of the 
classical Wald-Wolfowitz runs test. We omit further 
details since in present examples the choice ct = \ 
seems best. 

In the handful of other cases where things can 
be proved, systematic scan chains are not superior 
to random scan chains. One nice example involves 
graph coloring. A natural algorithm is to scan through 
vertices and try a new color. If this results in a legit- 
imate coloring, the change is made, else the previous 
coloring is kept. Should one choose vertices at ran- 
dom or scan through systematically? Dyer et al. [7] 
managed to find classes of graphs where random and 
systematic scans are comparable. 

Similar results are found for a natural statisti- 
cal problem involving a non-uniform distribution on 
permutations (Mallows model through Kendall's tau). 
Diaconis and Ram [6] coupled with Benjamini et al. 
[2] found random pairwise transpositions followed 
by Metropolis comparable with systematically scan- 
ning through all coordinates. The Diaconis and Ram 
paper contains a literature review of scanning strate- 
gies. 

A very important point made by Casella and Levine 
is that asymptotic variance of a few statistics of in- 
terest gives an important alternative notion of con- 
vergence that can give different answers. This is an 
important research area. See Bassetti and Diaconis 
[1] for some first steps/tools. 

Finally, we note and mildly object to equating 
convergence rates with spectral gap. The present 
authors have spent much of their careers trying to 
make the point that practical convergence of Markov 
chains depends on much subtler notions. Consider 
the Poisson-Gamma example in our Proposition 4.4 
with a = a = 1 . The second eigenvalue of the x-chain 
is ^. If we just use this, we get the usual bound 



1 



|^j-"l||TV < 1/ 



with m{x) 



This bound implies that it takes £ of order j steps 
to randomize. Proposition 4.4 shows that I of order 
log j steps is the right answer. We may wonder why 
tuning behaviour to a criterion (like spectral gap) 
tangentially related to convergence is worthwhile. 
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Of course, we too have sinned in this direction; see 
[3]. 

In conclusion, we thank our discussants and edi- 
tors for their help, encouragement and good ideas. 
Thanks to Susan Holmes for the Poisson example 
and to Guoqiang Hu and Wai Wai Liu for help with 
the beta-binomial example. 
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